As is well known, we began experimenting with Muon for large-scale LLM training early on. In particular, in "Muon Sequel: Why We Chose to Experiment with Muon?", we proposed the "Match Adam Update RMS" technique to facilitate quick migration from Adam to Muon, a technique also applied in Kimi K2 training. This technique involves standardizing Muon's Update RMS to 0.2, allowing us to reuse Adam's learning rates and weight decay rates.

Behind this technique lies our observation that Adam's Update RMS is approximately 0.2, and this phenomenon is stable and reproducible. This raises an interesting question: Why is Adam's Update RMS 0.2? Can we explain it theoretically?

Problem Introduction#

First, let's describe the phenomenon: From experiments, we observe that after warmup ends and the model enters formal training, Adam's Update RMS consistently remains between 0.2 and 0.3, with models of different sizes showing similar patterns. The commonality among these models is that they are all trained with Adam with parameters $\beta_1=0.9,\beta_2=0.95$. Since this commonality is evident, it is likely not coincidental, prompting the author to analyze the underlying principles.

Now, let's review the form of the Adam optimizer:

(1) \[ \text{Adam}\color{skyblue}{\text{W}}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{u}_t =\hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t (\boldsymbol{u}_t \color{skyblue}{ + \lambda_t \boldsymbol{\theta}_{t-1}}) \end{aligned}\right. \]

Note: In this article, all vector multiplication/division, including squaring, defaults to element-wise Hadamard product/quotient.

Our goal is to prove that $\Vert\boldsymbol{u}_t\Vert_{RMS}\approx 0.2$, at least for the parameter setting $\beta_1=0.9,\beta_2=0.95$. We assume $\epsilon$ is sufficiently small to be negligible, and we consider the steady state as $t\to \infty$, where $\beta_1^t$ and $\beta_2^t$ are sufficiently close to zero. Thus, we do not need to distinguish between $\boldsymbol{m}_t$ and $\hat{\boldsymbol{m}}_t$, or between $\boldsymbol{v}_t$ and $\hat{\boldsymbol{v}}_t$, yielding $\boldsymbol{u}_t =\boldsymbol{m}_t/\sqrt{\boldsymbol{v}_t}$.

For $\boldsymbol{m}_t$ and $\boldsymbol{v}_t$, we can obtain the expansions:

(2) \(\boldsymbol{m}_t = (1 - \beta_1)\sum_{i=1}^t \beta_1^{t-i}\boldsymbol{g}_i,\qquad \boldsymbol{v}_t = (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\boldsymbol{g}_i^2\)

Numerical Simulation#

If we assume that $\boldsymbol{g}_1,\boldsymbol{g}_2,\cdots,\boldsymbol{g}_t$ are sampled from the same distribution, we can directly use numerical simulation to estimate $\Vert\boldsymbol{u}_t\Vert_{RMS}$. Without delay, let's test with the simplest standard normal distribution $\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$. Reference code:

Python Simulation Code
import numpy as np

N, T = 10000, 2000
beta1, beta2 = 0.9, 0.95
m, v = 0, 0
for t in range(1, T + 1):
    g = np.random.randn(N)
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g**2
    u = m / v**0.5

rms = (u**2).mean()**0.5
print(rms)

What do you think the result is? The answer is approximately 0.225, surprisingly similar to experimental results! This conversely indicates that our simulation assumptions align well with actual scenarios. Some readers might object: isn't $\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ pure noise? How can this match? Actual training is certainly not pure noise, but it suggests that the signal-to-noise ratio of single-step gradients is extremely low, allowing pure noise to serve as a simulation.

Readers can experiment with the above reference code to observe influencing variables on Update RMS. The general conclusion is: Update RMS is positively correlated with $\beta_1$, appears unrelated to $\beta_2$, and increases if $\boldsymbol{g}$ has a non-zero mean (equivalent to increasing gradient signal-to-noise ratio).

Mean-Field Approximation#

In this section, the author attempts to derive an approximate analytical solution for the above simulation results theoretically. First, from the definition of RMS, to find $\Vert\boldsymbol{u}_t\Vert_{RMS}$, we need to compute $\boldsymbol{u}_t^2 = \boldsymbol{m}_t^2/\boldsymbol{v}_t$. The author's idea is to use the expectation of $\boldsymbol{u}_t^2$ as an approximation and further transform it into a mean-field approximation:

(3) \(\mathbb{E}[\boldsymbol{u}_t^2] = \mathbb{E}\left[\frac{\boldsymbol{m}_t^2}{\boldsymbol{v}_t}\right] \approx \frac{\mathbb{E}[\boldsymbol{m}_t^2]}{\mathbb{E}[\boldsymbol{v}_t]}\)

Some readers might question the validity of the last approximation. The author's suggestion is to proceed regardless, similar to the assumption $\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ in the previous section—compute first, and if the result is reasonable, the process is likely somewhat valid. Now we compute the numerator and denominator separately. Generally, let $\mathbb{E}[\boldsymbol{g}]=\boldsymbol{\mu},\mathbb{E}[\boldsymbol{g}^2]=\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2 $. The denominator is simpler:

(4) \[ \begin{aligned} \mathbb{E}[\boldsymbol{v}_t] =&\, (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\mathbb{E}[\boldsymbol{g}_i^2] \\ =&\, (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}(\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\ =&\, (1 - \beta_2^t) (\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\[5pt] \approx &\, \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2 \qquad(t\to\infty) \end{aligned} \]

As for the numerator, we can either expand the square directly or take a shortcut: we need the second moment $\mathbb{E}[\boldsymbol{m}_t^2]$, which equals $\mathbb{E}[\boldsymbol{m}_t]^2 + \mathbb{V}ar[\boldsymbol{m}_t]$. The computation of $\mathbb{E}[\boldsymbol{m}_t]$ is similar to $\mathbb{E}[\boldsymbol{v}_t]$, resulting in $(1 - \beta_1^t)\boldsymbol{\mu}\approx\boldsymbol{\mu}$. For variance, it is additive under independence:

(5) \(\mathbb{V}ar[\boldsymbol{m}_t] = (1 - \beta_1)^2\sum_{i=1}^t \beta_1^{2(t-i)}\boldsymbol{\sigma}^2 = \frac{(1 - \beta_1)^2 (1 - \beta_1^{2t})}{1 - \beta_1^2}\boldsymbol{\sigma}^2\approx \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2\qquad (t\to\infty)\)

Thus:

(6) \(\mathbb{E}[\boldsymbol{u}_t^2]\approx \frac{\boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2}{\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2}\)

Result Analysis#

Since $\mathbb{E}[\boldsymbol{u}_t^2]$ is already the squared vector, to estimate $\Vert\boldsymbol{u}_t\Vert_{RMS}$, we only need to average over components and take the square root. For this averaging step, we can apply another mean-field approximation (averaging numerator and denominator separately), ultimately obtaining:

(7) \(\Vert\boldsymbol{u}_t\Vert_{RMS} \approx \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}{\Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2}} = \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}}{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + 1}}\)

It has two influencing factors: first, $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$, which can be viewed as the gradient signal-to-noise ratio (SNR); second, $\beta_1$, one of Adam's hyperparameters. Notably, the result does not depend on $\beta_2$, consistent with earlier simulation results. How accurate is this approximation? Let's consider the simplest special case $\boldsymbol{\mu}=\boldsymbol{0}$:

(8) \(\Vert\boldsymbol{u}_t\Vert_{RMS} \approx \sqrt{\frac{1 - \beta_1}{1 + \beta_1}}\)

Substituting $\beta_1=0.9$ yields $0.2294\cdots$, which matches both simulation results and practical performance! Further comparisons with simulation results are shown below:

Simulation Results vs. Mean-Field Approximation (Different beta1, beta2)

Simulation Results vs. Mean-Field Approximation (Different β₁, β₂)

Overall, the approximation is quite good, especially when $\beta_2 \geq 0.9$, where results almost coincide with the mean-field approximation (as noted by @EIFY, the paper "Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks" derived the same result).

Comparisons considering SNR are shown below:

Simulation Results vs. Mean-Field Approximation (Different beta1, SNR)

Simulation Results vs. Mean-Field Approximation (Different β₁, SNR)

When SNR increases, the mean-field approximation error grows but still predicts the overall trend. In practice, gradient SNR rarely approaches 1, so mean-field remains a good approximation.

Reverse Prediction#

If we accept the mean-field approximation (7), we can reverse it to estimate gradient SNR:

(9) \(\frac{\Vert\boldsymbol{\mu}\Vert^2}{\Vert\boldsymbol{\sigma}\Vert^2} \approx \frac{\Vert\boldsymbol{u}_t\Vert_{RMS}^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \Vert\boldsymbol{u}_t\Vert_{RMS}^2}\)

In practical training, $\beta_1$ is given, and $\Vert\boldsymbol{u}_t\Vert_{RMS}$ (Adam's Update RMS) can be directly estimated, making the above equation computable. However, this formula only applies to Adam. Is there a more general estimation approach? Indeed! Recall our earlier estimate:

(10) \(\mathbb{E}[\boldsymbol{m}_t^2]\approx \boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2\)

Summing its components and taking the square root, we consider it an approximation of $\Vert\boldsymbol{m}_t\Vert$:

(11) \(\Vert\boldsymbol{m}_t\Vert\approx \sqrt{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}\)

As for the second moment, $\mathbb{E}[\boldsymbol{v}_t]\approx \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2$. Optimizers like Muon lack second moments, but note that the second moment result is independent of $\beta_2$. Thus, we consider the simplest special case—$\beta_2=0$—where $\boldsymbol{v}_t=\boldsymbol{g}_t^2$. This might be somewhat forced, but estimation often prioritizes convenience. This "approximation" implies $\Vert\boldsymbol{g}_t\Vert^2\approx \Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2$, yielding:

(12) \(\frac{\Vert\boldsymbol{m}_t\Vert}{\Vert\boldsymbol{g}_t\Vert}\approx \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}{\Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2}}\)

The right side resembles Equation (7), so we can write:

(13) \(\frac{\Vert\boldsymbol{\mu}\Vert^2}{\Vert\boldsymbol{\sigma}\Vert^2} \approx \frac{\Vert\boldsymbol{m}_t\Vert^2/\Vert\boldsymbol{g}_t\Vert^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \Vert\boldsymbol{m}_t\Vert^2/\Vert\boldsymbol{g}_t\Vert^2}\)

That is, replacing $\Vert\boldsymbol{u}_t\Vert_{RMS}$ with $\Vert\boldsymbol{m}_t\Vert/\Vert\boldsymbol{g}_t\Vert$ provides a general approach for estimating $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$ in momentum-based optimizers. What about optimizers without momentum? There's no solution, as $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$ is a cross-trajectory statistic; we need some cross-trajectory statistical information to estimate it.

Summary#

This article explores Adam's Update RMS from both simulation experiments and theoretical approximations, providing theoretical justification for aligning Update RMS to 0.2 in the Muon optimizer.

Citation Information

Original Article: Su Jianlin. Why is Adam's Update RMS 0.2? Scientific Spaces.

How to cite this translation:

Su, J. Why is Adam's Update RMS 0.2? [Translated by Juanxi Tian]. Scientific Spaces.

BibTeX:

@article{su2025adam_update_rms, title = {Why is Adam's Update RMS 0.2?}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/11267}, note = {Translated by Juanxi Tian (ScalingOpt Team)} }